OpenVOC - Open Platform for Multilingual Vocabulary Training Integrating Speech Technology Components
نویسندگان
چکیده
The language acquisition consists of several parts which can not be learned independently: grammar, pronunciation, vocabulary and others. Whereas the first has to be learned interactively, lead by tutor or teacher to be efficient (e.g. by taking part in language training courses offered by language schools), vocabularies and their pronunciation can be partly learned autonomously. Language training therefore should cover all these parts by different means of practice material and support: dialogs simulating typical situations which appear in real life, the interactive construction of sentences to practice grammar, “listen and repeat” exercises to train understanding and pronunciation, vocabulary drills for memorizing. An “automatic teaching program” may appear inferior to a human language teacher, but it does have some potential advantages: it can be used an unlimited amount of time and students get less embarrassed than in a classroom. Although there are various language training programs available, there is a lack of systems which use speech technology. This paper presents a framework for such a system to ease the evaluation of speech technology (used in educational systems) and to boost the application of speech synthesis and recognition in general. All core components for a learning application are implemented: A database for lesson and user data, synthesis support using a simple command line interface, a platform independent graphical user interface which runs on a large scale of operating systems. The whole system is distributed under a free license to minimize the barriers imposed on the user and to boost the speech technology use in this field. 1. Current state of computer-assisted language teaching 1.1. Common approaches and existing systems Several commercial training systems exist which use auditory output to present correct pronunciation and speech recognition to evaluate speech quality, intonation and accentuation. They focus either on the learning of the basic vocabularies or on specific topics like business expressions. Most of them provide the possibility to present prerecorded spoken versions of the lecture material, some of them also a possibility to evaluate a recorded version of a students utterance. For a detailed discussion of the available programs and features see [1]. There are also a number of educational systems developed by universities all worldwide. The ISLE project [2] aims at the improvement of English as a second language for Italian and German learners. It uses an HMM-based recognizer trained on non-native speech to align student and reference utterance and language-specific mispronunciation rules to detect mistakes by the speaker. The SRI EduSpeak System [3] uses HMMs specifically adapted to non-native speakers using Bayesian adaption techniques. It was also shown [4] that the training with nonnative speech improves recognition rates, thus making it easier to judge pronunciation errors. A Japanese-only variant of a system using a synthetic or a natural reference and forced alignment was presented in [5]. It uses formant synthesis and can impose the correct prosody on the students speech. Another English pronunciation system for Japanese Students [6] models many common error patterns to detect erroneous phoneme segments. Other research is done on the Fluency project [7], the SPECO project [8] and the virtual language tutor [9]. Several free programs for language and vocabulary learning exists (KVocTrain, FlashKard, Langdrill, LingoTeach), though none of them is known to provide support for speech technology. 1.2. Evaluation and scientific issues The perfect automatic evaluation of speech by a non-native speaker regarding pronunciation is a still unsolved problem. Different ways for scoring the pronunciation quality using e.g. temporal measures and whether they correlate with expert ratings are described in [10, 11, 12]. Approaches exist that compare the recorded speech with several HMM reference models available for the utterance, calculating distances to them based on e.g. spectral or segmental similarity. Sometimes models are trained of the wrong patterns to identify errors. The derivation and presentation of reliable scores for the distinction between correct and wrong utterances has to consider problems like user acceptance and self confidence of the learner as well as the experience of the student with pronunciation training and the total amount of correction information that should be shown in order to actually improve pronunciation and not instead cause confusion. 2. The OpenVOC approach 2.1. Targets and underlying speech technology One of the major problems second language learners have to cope with is insufficient vocabulary knowledge. About 2,000 to 3,000 word families are needed to enable language use, whereas foreign students reading university texts need to have 10,000 to 11,000 word families at their disposal [13, 14]. The acquisition of such a large amount of vocabularies can be supported by computer programs and the use of speech technology. The presented system concentrates on the efficient rehearsal of single words and phrases that can be defined by the user according to his/her needs. The task of learning vocabularies can be divided in different levels of knowledge that have to be acquired to enable every day use. Mastering words means that one is able to read them, write them, understand their different meanings and speak them with correct pronunciation. Therefore the application has to support all these steps of learning. It has to provide means to achieve more intense exposure to the foreign language with the effect of learning new vocabularies in an efficient way. It should be possible to quickly define and rehearse larger amounts of new vocabularies. 2.2. Text-to-speech synthesis In contrast to all known commercial training software the program uses arbitrary external synthesis packages to provide the output. This enables the program to render acoustic versions of all requested words and phrases that are supported by the synthesizer, making it independent from preexisting lecture material. Speech synthesis is said to be not good enough to be used in applications that require accurate pronunciation. In the opinion of the authors currently available packages with carefully crafted voices provide enough quality to be used in educational software for single words and phrases. Although the quality of sentences and longer texts may suffer from the known effects of monotonic prosody and insufficient expressiveness, these constraints do not apply in the limited domain of a vocabulary training application that uses learning material which mainly consists of single words and short phrases. As a pronunciation reference and for teaching purposes, speech synthesis output is already successfully used in some on-line dictionaries such as the LEO dictionary for translations from English and French to German [15]. The LexDRESS system [16] is focusing on word domain synthesis (for German) by using a special allophone set and a word-based prosodic model. In [17], a bilingual synthesis approach (for Russian and German) is introduced which is leading to consistent system resources for a dictionary. Independent from limited word domain approaches, stateof-the-art corpus synthesizers achieve an intelligibility of almost 100% in a known domain and Mean Opinion Scores (MOS) of about 3.8 [18] (compared to MOS of natural speech of about 4.7). Consequently, these synthesizers can be already used for vocabulary training (at least as a first acoustic reference). Nevertheless, for a qualified pronunciation teaching to reduce foreign accent, the synthesis technology requires further improvement. 2.3. Synthesis support Widely available speech synthesis packages include Mbrola [19] and Festival [20], which provide support for many different languages. The Festival package is provided under the revised BSD license. It provides support for English and Spanish voices and offers all functions of a complete TTS system. The Mbrola program is distributed under non-free license terms allowing non-commercial and non-military use only. It provides only the phone-to-wave conversion and can be combined with different text-to-phone converters to create a full text-to-speech system. It can be used for a wide range of languages. For support of German and Russian a common TTS-synthesis prototype is used [17]. Figure 1: Application structure 2.4. Automatic speech recognition Although there are a several speech synthesis packages with a large number of different voices available, very few speech recognition systems can be obtained freely. Worse, most of these systems have to be trained to a certain speaker and are very limited in the number of languages with which they can be used. To minimize the dependencies on speech technology software used by the program, the method of recognition-bysynthesis is used to prepare the speech signal patterns to be evaluated. This only requires the same speech synthesis package as used beforehand. To align the synthesized reference and the student utterance, Dynamic Time Warping (DTW) is used. Various characteristics of the reference and student signal are then calculated and compared. 2.5. Intellectual property issues Nowadays speech technology is readily available in many different languages, for various operating systems and under free or restricted license terms. The presented program makes it possible to use nearly all available speech synthesis packages that provide a command line or system library interface. This makes the system very versatile and allows easy integration of new or experimental synthesis packages as the become available. The program is released under the General Public License (GPL) of the Free Software Foundation, granting the freedom to run the program for any use, to study how it works, to change and to redistribute it. The authors hope to in this way lower the barriers and the costs normally associated with such a training software and to foster community development and widespread use by not imposing unnecessary limitations. 3. System architecture 3.1. Application framework The system is divided in session management and synthesis related parts. The latter is also responsible for the platform dependant audio interface. The program is written using objectoriented C++ source code and uses the Gimp Toolkit (GTK) for the user interface. Training material, program files and user data are separated and individual settings are saved between sessions. The students performance is monitored and information on vocabulary history and difficulties is used to adopt the training process accordingly. Erdbeere {w.} strawberry земляника; клубника Figure 2: Corresponding XML data.
منابع مشابه
A multi-lingual language model for large vocabulary speech recognition
In this paper the language model for use in multilingual speech technology systems is described that was developed in ESPRIT project 860 'Linguistic Analysis of European Languages'. The model can serve in a variety of applications. In the present contribution attention is focussed on its role in a large vocabulary isolated words speech recognition system. Problems in training and testing a dass...
متن کاملTwo-way speech-to-speech translation on handheld devices
This paper presents a two-way speech translation system that is completely hosted on an off-the-shelf handheld device. Specifically, this end-to-end system includes an HMM-based large vocabulary continuous speech recognizer (LVCSR) for both English and Chinese using statistical -grams, a two-way translation system between English and Chinese, and, a multilingual speech synthesis system that out...
متن کاملREGULUS: A Generic Multilingual Open Source Platform for Grammar-Based Speech Applications
We present an overview of Regulus, an Open Source platform that supports corpus-based derivation of efficient domain-specific speech recognisers from general linguistically motivated unification grammars. We list available Open Source resources, which include compilers, resource grammars for various languages, documentation and a development environment. The greater part of the paper presents a...
متن کاملThe GlobalPhone Project: Multilingual LVCSR with JANUS-3
This paper describes our recent e ort in developing the GlobalPhone database for multilingual large vocabulary continuous speech recognition. In particular we present the current status of the GlobalPhone corpus containing high quality speech data for the 9 languages Arabic, Chinese, Croatic, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. We also discuss the JANUS-3 toolkit and ho...
متن کاملA Platform for Multilingual Research in Spoken Dialogue Systems
Multilingual speech technology research would be greatly facilitated by an integrated and comprehensive set of software tools that enable research and development of core language technologies and interactive language systems in any language. Such a multilingual platform has been one of our goals in developing the CSLU Toolkit. The Toolkit is composed of components that are essentially language...
متن کامل